1. Acknowledge and refine your attitude.
It is important to explicitly acknowledge one’s initial attitude at the outset of a project, because this attitude determines one’s preconceived tendencies in terms of:
- the perceived phenomenon of study, including the types of questions one asks, experiences one examines, or stimuli one creates;
- the perceived context of the phenomenon, including that which surrounds its experience, interpretation, and the application of its results;
- the types and aspects of potential data that one perceives to be focally important, and therefore attends to for greater periods of time;
- the methods that one employs to analyze, essentialize, or evaluate one’s data;
- the types and classes or states of variables that one selects to examine;
- the natural language sample one acquires to represent or reveal the experience of the phenomenon.
Before turning to the specific actions involved in selecting an attitude, we note that the attitude that Raven’s Eye employs is described in the General tendencies and attitudes section of these Technicals. It is this attitude that both facilitates your research, and enables you to avoid some of the attitude-borne assumptions and limitations programmed into other natural language processing and qualitative data analysis programs.
The actions involved in acknowledging and refining an attitude now follow.
1.1 Select or construct a phenomenon.
1.2 Determine your lens.
1.3 Determine your focus.
1.4 Determine your procedures and adjunct analyses.
1.5 Select your variables and their respective classes or states.
1.6 Define and acquire your natural language sample.
1.6.1 Sampling for generalization.
If you are operating from a quantitative or logical positivistic perspective in which generalization is of concern, you may use random and representative sampling procedures to select your sample, and then make scientifically sound claims about your confidence in the ability to generalize your results to a given population and stimulus. Calculating the relationship between sample size and confidence in generalization does, however, involve a different procedure than is typically found in most social science research methods handbooks.
This is because naturally produced language results in word proportions that are not normally distributed, nor independent. Fortuitously, when combined with the results produced by our algorithm the systematically predictable nature word relationships often leads to relatively fewer cases (or participants) required for scientifically sound generalization than is generally otherwise required by typical statistical sample size calculations.
Our results include an Overrepresentation score for each word in a given dataset. This score is depicted along the vertical axis in the main chart, and also presented in the rightmost column of the main table. As described in the Understanding your results page of our Practicals, this score represents the proportion of each word in the response set (or column) as compared to that same word's proportion in your selected language corpus. For instance, an Overrepresentation score of 10 means that the word is found in the responses or cases at 10 times its typical use in the background corpus. Except in very small samples, such an Overrepresentation score would generally indicate a moderate degree of association between the word and the stimulus or question leading to the production of the natural language response. The word is, therefore, rather particular to the group and the stimulus. If the word is also relatively frequent, it is likely a popular means of expressing an idea that is central to the themes in the responses.
Estimating confidence in already acquired datasets. The Overrepresentation score is also at the same time an estimate of confidence in the ability to make general claims about the word's particularity to the stimulus (or survey question) and population. All else being equal, if the sample producing the Overrepresentation score from the previous example consisted of 500 responses, this would mean that for such a result to happen purely by chance (and not because it is particularly related to the stimulus) you would need to acquire an additional 4,500 responses (n-1; 10-1 = 9 x 500 = 4500), each of whom would need to refrain from ever mentioning that word in each and every one of their responses. If the word appears in the most frequent 50% of your data and your sample is of sufficient size, it is highly improbable that such an Overrepresentation is produced by chance alone. Instead, you can be confident that the word is predictably related to your stimulus and sample.
Calculating sample size needs prior to acquiring data. Depending on the population size to which you intend to generalize your results, the word Overrepresentation score thresholds that you decide are sufficient to warrant inclusion in your results, and the amount of confidence you require in your results, a sample size can be calculated before collecting data. Continuing the example, suppose that you have a population of 10,000 people, and want to be sure that words with Overrepresentation scores of 10 or greater in your sample can be confidently applied to it, such that even if no participant outside your sample ever mentions the word once*, the word still remains overrepresented at a proportion that is 1.5 times its typical background rate. Being 50% more likely to be expressed when presented your stimulus than its likelihood of expression in your selected language corpus generally, this word would still then be somewhat particular to your stimulus. To have this degree of confidence in your results, you would then need a sample of 1,500 cases (10,000 / 10 = 1,000 x 1.5 = 1,500).
Those words with Overrepresentation scores higher than your selected threshold of 10 would of course also remain proportionally overrepresented, even if they were also never mentioned again in any of the subsequent responses*. In our example of a population of 10,000 people and an Overrepresentation threshold of 10:1.5 (provided no one else ever mentions the overrepresented words again), a word with an original Overrepresentation score of 20 would remain overrepresented at 3 times its rate in the background corpus. Similarly, a word with an Overrepresentation score of 100 would remain overrepresented at 15 times its background rate.
Provided a discrete stimulus—such as an open-ended survey or interview question on one topic requiring about a few paragraphs or less of text to respond—the algorithm produces Overrepresentation scores that often range in the 50s to the 1,000s for words related to the stimulus. For such stimuli then, sample sizes of around 100 cases may produce results that can be confidently generalized to populations of 10,000 and more. For comparison, were the stimulus posed a binary (yes/no) question and one wanted 95% confidence at a +/- 5% interval, 370 cases would be required to generalize results to a population of 10,000 people (and approximately 2100 cases would be required for 99% confidence at a +/- 2.5% interval).
*As noted previously, this would be quite highly unlikely if the word is in the most frequent 50% of your data.
Describing your attitude.
We encourage you to explicitly describe in writing the specific concepts, processes, and contexts involved during the acknowledgement and refinement of your attitude. Doing so not only helps others to understand the peoples, purposes and procedures involved in your project, and therefore the scope of your results, but also serves as a source of latter complementary investigation.